Hone: "Scaling Down" Hadoop on Shared-Memory Systems

نویسندگان

K. Ashwin Kumar

Jonathan Gluck

Amol Deshpande

Jimmy J. Lin

چکیده

The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether “typical” analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operates over smaller and more refined datasets. To address these trends, we propose “scaling down” Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and efficiently execute it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes. Our experiments show that Hone can be an order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone can outperform a fully-distributed 15-node Hadoop cluster in some cases as well.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems

متن کامل

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ sing...

متن کامل

Parallelization Strategies for Distributed Non Negative Matrix Factorization

Dimensionality reduction and clustering have been the subject of intense research efforts over the past few years [2]. They offer an approach of knowledge extraction from huge amounts of data. Although some of these techniques are effective at achieving lower data dimensions, very few focused on scaling the techniques to tackle data sets that might not fit into memory. Non negative matrix facto...

متن کامل

Parallel algorithms for clustering biological graphs on distributed and shared memory architectures

Graph algorithms on parallel architectures present an interesting case study for irregular applications. In this paper, we address one such irregular application — one of clustering real world graphs constructed out of biological data using parallel computers. While theoretical formulations of the clustering operation are either intractable or computationally prohibitive, efficient heuristics e...

متن کامل

A Distributed Phoenix++ Framework for Big Data Recommendation Systems

Recommendation systems are important big data applications that are used in many business sectors of the global economy. While many users utilize Hadoop-like MapReduce systems to implement recommendation systems, we utilize the highperformance shared-memory MapReduce system Phoenix++ to design a faster recommendation engine. In this paper, we design a distributed out-ofcore recommendation algor...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 6 شماره

صفحات -

تاریخ انتشار 2013

Hone: "Scaling Down" Hadoop on Shared-Memory Systems

نویسندگان

چکیده

منابع مشابه

Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Parallelization Strategies for Distributed Non Negative Matrix Factorization

Parallel algorithms for clustering biological graphs on distributed and shared memory architectures

A Distributed Phoenix++ Framework for Big Data Recommendation Systems

عنوان ژورنال:

اشتراک گذاری